feat: wire validated-but-unused config (feature 965 backlog)#23
Conversation
…ask #15214) Add RemediatorRegistry.RemediateWithStrategies(ctx, []string, problem): tries strategy types in order, stops on first success, returns wrapped last error if all fail; reuses Remediate so circuit-breaker/rate-limit/ cooldown/dry-run all apply. detector.evaluateRemediation builds the ordered list from remCfg.Strategies (each nested .Strategy) when present, else falls back to [remCfg.Strategy] (backward compatible). Tests cover first-fail-then-success ordering, short-circuit, all-fail, and fallback. Note: real config types are Strategy string + Strategies []MonitorRemediationConfig (task premise said []string); per-strategy params don't flow through Remediate today, so dispatch is by type string.
…k #15215) Replace the InjectProblemPattern no-op stub: store injected patterns (mutex-guarded, name-keyed/idempotent) and feed them into detectCommonCauseCorrelation alongside built-ins (deduped by name) via the same findNodesWithAllProblems/MinNodesForCorrelation/confidence logic. Signature now returns error (no existing callers): empty name/ problems rejected. Add correlator_test.go (first tests for the package): detection thresholds, confidence, GetStats, injection validation, and injected-pattern participation. Race-clean. Note: task referenced GetCorrelationSummary/AddPattern which don't exist; real equivalents are GetStats/GetActiveCorrelations.
…k #15216) MaxRemediationsPerMinute was parsed/defaulted(2)/validated but never enforced. Add a perMinuteBucket *rate.Limiter (golang.org/x/time/rate) on RemediatorRegistry via SetMaxRemediationsPerMinute (matching the Set* idiom; 0 = unlimited), wired in main from config. Enforced in Remediate AFTER the non-consuming per-hour window check so a per-hour rejection never burns a per-minute token and a rejected call isn't counted as executed. Updated the stale config comment. Tests: N succeed / N+1 rejected within a minute, and unlimited at 0. go mod tidy promoted x/time (+modernc/sqlite) to direct, dropped unused gogo/protobuf.
New pkg/logger.Init(cfg) builds a slog JSON/text handler at the configured level writing to stdout/stderr/file (consuming LogOutput/LogFile, which were validated but unused), sets slog default, and bridges the stdlib log package via an io.Writer that strips [LEVEL] prefixes -> slog levels — so all 374 existing log.Printf sites honor the configured format/destination with zero churn. Wired in main after config validation (warn-and-continue on error). Converted 7 detector hot paths to native structured slog with attributes. Updated stale config comment. Tests: JSON parseable, text, file output, prefix-bridge level mapping, level filtering. Also poll-for-stat fix to 3 pre-existing flaky remediation tests exposed by the slog timing change (production logic unchanged).
…nings (Task #15218) Add lease_client_test.go covering RequestLease: reachable+grants, reachable+denies, unreachable+fallback=true (FAKE approval — test documents the cluster-wide remediation storm this enables), and unreachable+fallback=false (error). Add prominent storm-risk WARNING above the FallbackOnUnreachable config field and in the helm values comment (value stays false). Committed only my hunks (pre-existing image.tag bump excluded).
…219) Add a test-ci target that runs the fast unit suite (-short) AND the integration suite (go test ./test/integration/... no -short, 10m timeout) so coverage gaps like TestLeaseCoordinationFlow and TestCorrelationDetectionFlow (test/integration/controller/) are exercised in one gate. Build-tagged integration tests (kind dual-stack) still need -tags=integration and are not run here. Added to .PHONY.
Apply PRAGMA journal_mode=WAL and synchronous=NORMAL in SQLiteStorage. Initialize (after ping) so concurrent RequestLease reads don't block the single writer and writes are faster while staying crash-safe in WAL mode. Connection pool was already pinned (SetMaxOpenConns/IdleConns(1), ConnMaxLifetime(0)). Add TestSQLiteStorage_WALAndSynchronous (file-backed temp DB) asserting journal_mode=wal, synchronous=1 (NORMAL), and MaxOpenConnections=1.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ce2e6ff2fd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if !strings.HasPrefix(trimmed, "[") { | ||
| return slog.LevelInfo, line |
There was a problem hiding this comment.
Preserve fatal log messages at high log levels
No-prefix standard-log writes are downgraded to INFO here, but several post-logger.Init fatal paths in cmd/node-doctor/main.go use log.Fatalf("Failed ...") without a [FATAL]/[ERROR] prefix. When operators configure logLevel: warn, error, or fatal, slog filters those INFO records and the daemon can exit without emitting the startup failure reason.
Useful? React with 👍 / 👎.
| remediationErr = err | ||
| return err | ||
| } | ||
| if err := r.checkPerMinuteRate(); err != nil { |
There was a problem hiding this comment.
Consume minute tokens only for executable remediations
The per-minute limiter consumes a token before controller lease acquisition and before resolving/checking the remediator, so non-executed attempts such as a lease denial, unreachable controller, unknown strategy, or CanRemediate cooldown still drain the bucket. With maxRemediationsPerMinute: 1, one denied lease can block the next actually-approved remediation for up to a minute even though no remediation ran.
Useful? React with 👍 / 👎.
…rategy config (Task #19263 phase 1) Built-in remediators were never registered, so registry.Remediate failed for every strategy. Add RegisterBuiltinRemediators (wired in main) for the two SAFE strategies: systemd-restart->SystemdRemediator, custom-script-> CustomRemediator. Thread per-strategy params via Problem.Metadata (service/scriptPath/args): buildStrategyList now returns the ordered []MonitorRemediationConfig and the detector loops them (first-success-wins preserved), encoding each strategy's params into the dispatched Problem. Systemd/Custom remediators read params from metadata with config fallback; CustomRemediator keeps abs-path/no-.. validation on the metadata path. node-reboot/pod-delete intentionally NOT registered (destructive; phase 2). Tests: registration, per-strategy metadata threading, dry-run dispatch, scriptPath validation. detector race-clean.
… #19263 phase 2) Add RegisterClusterRemediators(registry,cfg,client,nodeName,self...) wiring the two destructive strategies, fail-closed: registers node-reboot + pod-delete ONLY when a k8s client and nodeName are present (else logs and registers nothing). main builds an in-cluster clientset (non-fatal on error) + passes downward-API POD_NAME/POD_NAMESPACE for self-skip. NodeRebootRemediator: cordon -> drain (evict, skipping DaemonSet/mirror/ self pods) -> reboot via injected runner (default systemctl reboot); dry-run returns before ANY mutation; cordon failure aborts reboot; bounded per-phase timeouts; 30m cooldown. PodDeleteRemediator: deletes only the pod named in Problem.Metadata (namespace/pod), refuses mirror/self, dry-run log-only. Tested with fake clientset incl. dry-run-does-not-execute, drain ordering/skips, and fail-closed registration.
…222 (Task #19263 phase 3) Add flush-dns/restart-interface/reset-routing/flush-ipv6-route to validRemediationStrategies (+ restart-interface requires an interface) and register all four in RegisterBuiltinRemediators as NetworkRemediator closures (per-op). flush-ipv6-route is now registered + reachable, closing #17222. NetworkRemediator resolves the target interface from Problem.Metadata["interface"] (new MonitorRemediationConfig.Interface field threaded by the detector) with config fallback; AllowMetadataInterface lets the registered restart-interface singleton construct without a fixed NIC. Tests: 6 safe strategies registered, dry-run + executor dispatch of flush-ipv6-route, interface-from-metadata, missing-interface error.
Remove the freePort helper (bind :0 -> read port -> close -> rebind race). The exporter now captures its real bound address in Start (from the eagerly-bound listener) and exposes BoundAddr()/BoundPort(); tests construct an ephemeral-port exporter (Port 0, 9100 default applied only on the production path) and read the actual port back instead of pre-allocating. Migrated all freePort call sites. Verified: package -count=20 and -race -count=5 pass with zero bind flakes; full -short suite stable.
Drains the feature-965 "config validated but never wired" backlog PLUS the remediation-wiring epic discovered mid-work (TaskForge project 41). Stacked on #22 (
feat/ipv6-dualstack) — review/merge that first; base retargets tomainonce #22 lands. Fullgo test ./... -shortgreen (first try; the long-standing freePort flake is fixed here).Feature 965 (7 tasks)
InjectProblemPatternstub (was a no-op) + first package tests.MaxRemediationsPerMinute(x/time/ratetoken bucket).pkg/logger+[LEVEL]prefix bridge; zero-churn over 374 sites).FallbackOnUnreachablelease tests + cluster-wide remediation-storm warnings.test-ciMakefile gate (unit + integration).synchronous=NORMALfor controller storage.Remediation wiring epic — #19263 (high) + closes #17222
Discovered that no built-in remediators were registered at all (
Remediate(strategy)failed for every type). Fixed across 3 phases:RegisterBuiltinRemediatorsfor safe strategies (systemd-restart, custom-script); per-strategy params threaded viaProblem.Metadata; remediators read with config fallback.RegisterClusterRemediators(fail-closed without a k8s client) + new node-reboot (cordon→drain→reboot, dry-run-safe, skips DaemonSet/mirror/self, injected runner) and pod-delete (metadata-targeted, guarded) remediators; in-cluster clientset built in main.All 8 remediation strategies are now registered and reachable.
Test infra
BoundPort(); tests bind ephemeral + read back). Verified-count=20+-race.Per project policy, not auto-merged.